Skip to content

Early yield on 429 throttling on barrier requests#48914

Open
mbhaskar wants to merge 14 commits into
Azure:mainfrom
mbhaskar:early-yield-on-429-throttling
Open

Early yield on 429 throttling on barrier requests#48914
mbhaskar wants to merge 14 commits into
Azure:mainfrom
mbhaskar:early-yield-on-429-throttling

Conversation

@mbhaskar

@mbhaskar mbhaskar commented Apr 23, 2026

Copy link
Copy Markdown
Member

Description

This PR introduces early yield on 429 (throttling) responses during barrier requests for strong/bounded-staleness consistency.

When receiving 429s under strong consistency, the quorum reader/writer barrier code does not yield early enough. Instead it keeps retrying the barrier until the overall request timeout, producing multiple stack traces and accumulating resource pressure on the client. With this change, when all barrier responses are throttled (429), the SDK yields early rather than spinning until timeout:

  • Reads (QuorumReader): when the barrier responses are all 429s, the original 429 is surfaced/yielded early instead of being retried to timeout.
  • Writes (ConsistencyWriter): when the write barrier is throttled, a 408 (RequestTimeout) with substatus 21013 (SERVER_WRITE_BARRIER_THROTTLED) is returned early.

Feature flag (gated, default OFF)

This behavior is disabled by default and gated behind a static Configs feature flag, mirroring the updated .NET implementation which placed the same behavior behind a feature flag. The Java public SDK has no per-tenant config path, so a static Configs flag is used following existing patterns (e.g. isSessionTokenFalseProgressMergeEnabled()):

  • System property: COSMOS.ENABLE_BARRIER_EARLY_YIELD_ON_429
  • Environment variable: COSMOS_ENABLE_BARRIER_EARLY_YIELD_ON_429
  • Default: false
  • Accessor: Configs.isBarrierEarlyYieldOn429Enabled()

When the flag is off, behavior is identical to before this PR. Diagnostic detection logs are emitted unconditionally for livesite visibility; only the behavioral side effects (early-yield return, 408 throw, throttle tracking) and the "yielding early" action log are gated behind the flag.

Reference

Aligns with the updated .NET implementation, which puts this behavior behind a feature flag:
https://msdata.visualstudio.com/CosmosDB/_git/CosmosDB/pullrequest/2117699

Changes

  • Configs.java: new flag constants + isBarrierEarlyYieldOn429Enabled() accessor.
  • QuorumReader.java: gate the three read barrier early-yield paths behind the flag.
  • ConsistencyWriter.java: gate the write barrier 408/21013 early-yield and throttle tracking behind the flag.
  • HttpConstants.java: add substatus SERVER_WRITE_BARRIER_THROTTLED = 21013.
  • StoreResult.java: add isThrottledException helper.
  • Unit tests (QuorumReaderTest, ConsistencyWriterTest) for both flag-on and flag-off behavior, plus fault-injection E2E tests.
  • CHANGELOG updated.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

Copilot AI review requested due to automatic review settings April 23, 2026 18:04
@mbhaskar mbhaskar requested review from a team and kirankumarkolli as code owners April 23, 2026 18:04
@mbhaskar mbhaskar changed the title Early yield on 429 throttling Early yield on 429 throttling on barrier requests Apr 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Cosmos direct connectivity quorum/barrier logic to “yield early” when replica reads are uniformly throttled (HTTP 429), allowing the existing ResourceThrottleRetryPolicy to apply appropriate backoff instead of progressing into additional quorum/primary/barrier attempts.

Changes:

  • Add StoreResult.isThrottledException to cheaply detect 429 responses.
  • In QuorumReader, propagate 429 immediately when all collected replica results are throttled (including barrier paths).
  • In ConsistencyWriter, track throttling during write barriers and, when retries are exhausted and the last attempt was fully throttled, throw a RequestTimeoutException with a new substatus code; add unit tests for the new behaviors.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/StoreResult.java Adds a computed flag to identify throttling (429) on replica results.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/QuorumReader.java Early-yields on replica-wide throttling to let throttle retry policy handle backoff.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/ConsistencyWriter.java Tracks throttling during write barriers and surfaces a distinct timeout substatus when retries are exhausted.
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/HttpConstants.java Introduces a new substatus code for write-barrier throttling exhaustion.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/QuorumReaderTest.java Adds unit tests covering 429 propagation and Gone+429 interactions.
sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/directconnectivity/ConsistencyWriterTest.java Adds unit tests for write-barrier behavior under sustained throttling and mixed outcomes.

@xinlian12

Copy link
Copy Markdown
Member

@sdkReviewAgent

@xinlian12

Copy link
Copy Markdown
Member

Review complete (49:16)

Posted 1 inline comment(s).

Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage

@mbhaskar

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
No pipelines are associated with this pull request.

@mbhaskar

Copy link
Copy Markdown
Member Author

/azp run java - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

mbhaskar and others added 6 commits May 27, 2026 14:57
Port of .NET PR #1667829: When receiving repeated 429 (Too Many Requests)
responses with strong consistency, QuorumReader and ConsistencyWriter now
handle throttling more efficiently.

QuorumReader (reads):
- waitForReadBarrierAsync: yield early when all replicas return 429 in both
  single-region and multi-region barrier loops
- ensureQuorumSelectedStoreResponse: yield early when all replicas throttled
  during initial quorum read
- All cases throw the 429 exception to let ResourceThrottleRetryPolicy
  handle retry with appropriate backoff

ConsistencyWriter (writes):
- waitForWriteBarrierAsync: track lastAttemptWasThrottled flag per iteration
- Do NOT yield early (preserves idempotency guarantees)
- When all retries exhausted due to consistent throttling, throw
  RequestTimeoutException (408) with substatus SERVER_WRITE_BARRIER_THROTTLED
  (21013) instead of returning barrier-not-met

Other changes:
- Added isThrottledException field to StoreResult
- Added SERVER_WRITE_BARRIER_THROTTLED (21013) substatus code
- Unit tests for all throttling scenarios

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ed replica yields early

Port of .NET test ValidatesReadMultipleReplicaAsyncExcludesGoneReplicas.
Validates that when replicas return a mix of 410 (Gone) and 429 (TooManyRequests):
- Gone replicas are excluded from results by StoreReader (isValid=false for GONE)
- The 429 replica with valid LSN headers is kept (isValid=true for non-GONE with lsn>=0)
- Since all remaining replicas are throttled, early yield triggers
- The 429 exception propagates to ResourceThrottleRetryPolicy

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes from xinlian12 (blocking):
- Fix lastAttemptWasThrottled stale state: reset flag before
  avoidQuorumSelection early return to prevent incorrect 408 when
  prior iteration was throttled but current iteration hits 410
- Fix readStrong_AllReplicasThrottled_Returns429 false positive: set LSN
  on exception so StoreReader marks isValid=true, ensuring the early yield
  path is actually exercised. Add transport invocation count assertions
  to verify primary read is NOT attempted.
- Add readStrong_BarrierRequestsThrottled_Returns429 test covering the
  waitForReadBarrierAsync barrier path (quorum succeeds, then barrier
  HEAD requests return 429)

Fixes from Copilot review:
- Fix checkstyle: add missing spaces around = operator (2 places)
- Fix log wording: 'All replicas' -> 'All contacted replicas' (more
  accurate since not all replicas may be contacted per attempt)
- Fix ConsistencyWriter log: 'consistent throttling' -> 'last attempt
  was throttled' (flag only tracks last attempt, not all attempts)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tling

Unit tests (4 new, 10 total throttling tests):
- writeBarrier_AvoidQuorumSelectionAfterThrottling_NoFalse408: validates
  lastAttemptWasThrottled reset on avoidQuorumSelection path (stale state fix)
- writeBarrier_NRegionCommit_AllReplicasThrottled_Returns408: N-region
  synchronous commit barrier throttling produces 408/21013
- readStrong_QuorumNotSelected_PrimaryThrottled_Returns429: primary 429
  propagates correctly through QuorumNotSelected → readPrimary path
- readStrong_BarrierPartialThrottle_StillSucceeds: barrier succeeds when
  one replica is throttled but other meets LSN (no false-negative yield)

Fault injection E2E tests (3 new, require strong consistency account):
- faultInjection_readBarrierThrottled_yieldsEarly: inject 429 on
  HEAD_COLLECTION + GCLSN interceptor → verify early yield on reads
- faultInjection_writeBarrierThrottled_returns408: inject 429 on
  HEAD_COLLECTION + GCLSN interceptor → verify 408 on writes
- faultInjection_readBarrierThrottled_thenRecovers: inject 429 with
  hitLimit(2) → verify read succeeds after throttle clears

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Read and write barrier requests are only triggered on multi-region strong
consistency accounts (numberOfReadRegions > 0). The emulator is single-region,
so the GCLSN interceptor never triggers barriers and the tests fail with
empty supplementalResponseStatisticsList.

Added accountLevelReadRegions.size() > 1 skip check to all three E2E fault
injection tests so they correctly skip on single-region environments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mbhaskar mbhaskar force-pushed the early-yield-on-429-throttling branch from 28b1fae to 3cd6163 Compare May 27, 2026 21:59
mbhaskar and others added 3 commits June 3, 2026 22:58
…_ON_429 flag

Align with the updated .NET implementation (PR 2117699), which puts the
early-yield-on-429 barrier behavior behind a feature flag defaulting to off.

- Add Configs.isBarrierEarlyYieldOn429Enabled() following the existing
  COSMOS.* / COSMOS_* system-property/env-var pattern, default false.
- QuorumReader: keep the 'all replicas returned 429' diagnostic logs
  unconditional, but gate the three early-yield Mono/Flux.error returns
  behind the flag.
- ConsistencyWriter: gate lastAttemptWasThrottled tracking and the 408
  (substatus 21013) throw behind the flag; logs remain unconditional.
- Tests: enable the flag for the barrier-throttle unit/E2E tests and add a
  flag-off default test; restore the system property after each test.
- CHANGELOG: note the behavior is now opt-in and disabled by default.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- FaultInjectionServerErrorRuleOnDirectTests: the write-barrier E2E test
  now asserts the thrown CosmosException is 408 with substatus 21013
  (SERVER_WRITE_BARRIER_THROTTLED) instead of only checking the fault rule
  hit count, so it actually validates the write-side behavior. Also move
  the System.setProperty for the feature flag inside the try block so the
  finally always clears it even if fault-rule setup throws.
- QuorumReader: the unconditional 'all replicas returned 429' detection log
  no longer claims 'Yielding early'; the yield action is now logged inside
  the feature-flag guard so the log isn't misleading when the flag is off.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…throttling

# Conflicts:
#	sdk/cosmos/azure-cosmos/CHANGELOG.md
- StoreResult: exclude throughput-control 429s from isThrottledException so
  client-side throttling does not falsely trigger early yield
- ConsistencyWriter: simplify lastAttemptWasThrottled assignment to
  set(enableBarrierEarlyYieldOn429)
- E2E write test: fail when createItem unexpectedly succeeds
- Add E2E test for default flag-off behavior asserting the 21013 early-yield
  substatus is not produced

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@kushagraThapar kushagraThapar left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbhaskar for tackling this — the throttle amplification on barrier loops is real, and gating it behind a default-OFF feature flag (with the .NET parity story for future enablement) is exactly the right shape. The Copilot / xinlian12 / Praveen-Msft round-trips landed clean fixes; the notes below are the gaps I see after that.

Inline notes (7):

  • M1 — Building on xinlian12's :520 thread: 408/21013 currently routes through ClientRetryPolicy.shouldRetryOnRequestTimeout to PPAF endpoint-marking, not to ResourceThrottleRetryPolicy. Means a throttle storm becomes a region failover, which doesn't fix RU exhaustion at the source.
  • M2 — The synthesized RequestTimeoutException(message, URI, subStatusCode) overload drops both the underlying 429 cause AND the server's x-ms-retry-after-ms header. Capturing into an AtomicReference<CosmosException> parallel to lastAttemptWasThrottled would preserve both.
  • M3 — Write-side terminal yield may preempt dynamic-quorum convergence — waitForWriteBarrierAsync previously polled until replicas committed; now we yield terminally when one replica is hot. Worth pinging the replication owners on the Strong-vs-Eventual semantics (chapter 05 #write-path-consistencywriter-internals).
  • M4 — Building on xinlian12's :520 thread: 21013 has a 3-part documentation cliff — StatusCodes.md row missing, CHANGELOG bullet doesn't mention the new substatus value, chapter 17 § 5.1 needs the 408-substatus entry.
  • M5 — Three near-identical 429-allMatch early-yield blocks in QuorumReader (:443, :713, :810). Drift risk especially since xinlian12's still-open retryAfterInMs thread will need to touch all three.
  • M6 — The 20-arg positional new StoreResult(...) ctor is duplicated across 5+ new test methods in ConsistencyWriterTest (plus more in QuorumReaderTest). This is the third PR in recent memory adding to that pattern — would love a StoreResultTestFactory extraction as a follow-up.
  • m16SERVER_WRITE_BARRIER_THROTTLED naming suggests server-origin but is client-synthesized; sibling precedent is *_EXCEEDED_RETRY_LIMIT. Cheaper to rename pre-merge than post-release.

Cross-SDK 21013 dispute resolution (re xinlian12's note at HttpConstants.java:504):

xinlian12 asked whether SERVER_WRITE_BARRIER_THROTTLED = 21013 exists in .NET; the reply indicated .NET does have something equivalent. I tried to verify this end-to-end against public sources:

  • Azure/azure-cosmos-dotnet-v3 main (SHA e45ca1976cea0986455403422ce0e7380157d613):
    • git grep -wn "21013" -- '*.cs' → zero source hits (only test JSON false positives)
    • grep -rn "BarrierThrottled\|WriteBarrierThrottled" --include="*.cs" → zero hits
    • ConfigurationManager.cs (~30 AZURE_COSMOS_* env vars) → zero hits for any barrier / early-yield / 429 flag

So in the public .NET v3 surface, the feature isn't present. The reply is most likely referring to the closed Microsoft.Azure.Cosmos.Direct NuGet (which hosts ConsistencyWriter for .NET) or to internal MSDATA PR #2117699 — both of which I can't see from here.

The reason this matters: 21013 is client-synthesized by Java, not server-emitted. If MSDATA picks a different numeric value (or a different substatus name) when they add the same feature, cross-SDK diagnostics keyed on substatus (e.g. customer dashboards correlating .NET vs Java telemetry) will permanently diverge. Suggesting a quick sync with the .NET team to lock the numeric value + canonical name before this merges. Happy to be wrong if there's already cross-SDK alignment I'm not seeing.

Design-doc follow-ups (separate PR against tvaron3/cosmosdb-design-docs):

  • D1 — chapter 17 17-status-codes-and-sdk-retries.md § 5.1 "408 substatus reference" — add the 21010 / 21011 / 21012 / 21013 row with the local-commit-state column + retry-policy mapping
  • D2 — chapter 05 05-replication-consistency.md — add a "Barrier HEAD request throttling" subsection covering both the read early-yield (returning the underlying 429 to ResourceThrottleRetryPolicy) and the write terminal 408/21013 asymmetry
  • D3 — chapter 11 11-request-units-throttling.md — cross-ref note that barrier HEAD requests share the per-physical RU bucket with data-plane reads, so a single hot logical partition can throttle barriers indefinitely

Also FYI — PR currently shows mergeable: CONFLICTING so a rebase is needed regardless; might be cheap to fold the above into that rebase pass.

Approving with notes — none of this is blocking the early-yield mechanic itself, which I think is the right architectural call given the flag-gating + .NET parity story.

}

if (writeBarrierRetryCount.get() == 0) {
if (this.enableBarrierEarlyYieldOn429 && lastAttemptWasThrottled.get()) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'd love your read on (and probably worth pinging the replication owners):

The write-side waitForWriteBarrierAsync barrier loop polls replicas waiting for globalCommittedLsn to catch up — i.e. it's the dynamic-quorum convergence loop. If one replica's RU bucket is hot enough that it returns 429 on every barrier poll, but the OTHER replicas eventually catch up and commit, the prior behavior (return Mono.just(false) after retries exhausted → caller falls back to recomputing quorum) was effectively eventual-consistency-friendly. The new behavior (throw RequestTimeoutException(21013) on the same condition) is terminal — we never get to see whether the slow-but-not-throttled replicas converged.

For Strong / BoundedStaleness this might actually be the right call (we'd rather surface a clear timeout than serve a stale read). For Eventual / Session it feels slightly aggressive. Wondering if the design intent is to gate this on consistency level, or whether the chapter 05 05-replication-consistency.md#write-path-consistencywriter-internals contract explicitly permits the terminal yield regardless. Could be worth a comment thread with the replication owners before this merges.

public static final int SERVER_GENERATED_408 = 21010;
public static final int FAILED_TO_PARSE_SERVER_RESPONSE = 21011;
public static final int GLOBAL_N_REGION_COMMIT_WRITE_BARRIER_NOT_MET = 21012;
public static final int SERVER_WRITE_BARRIER_THROTTLED = 21013;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building on xinlian12's thread at HttpConstants.java:504:

Even setting aside the cross-SDK numeric coordination question (covered in the summary), 21013 as a customer-observable substatus has a documentation cliff:

  1. sdk/cosmos/azure-cosmos/docs/StatusCodes.md lists every other 408 substatus (21010 SERVER_GENERATED_408, 21011 FAILED_TO_PARSE_SERVER_RESPONSE, 21012 GLOBAL_N_REGION_COMMIT_WRITE_BARRIER_NOT_MET) with semantics + retry guidance. 21013 needs a parallel row so customers who see "408 substatus 21013" in their telemetry can self-serve diagnose.
  2. The CHANGELOG bullet currently reads "Added early yield on 429 throttling on barrier requests" — it doesn't mention the new substatus value or what it means for downstream alerting / dashboards. Customers with substatus-keyed monitoring will see a brand-new 408 substatus appear without warning.
  3. Design-doc follow-up (separate PR): chapter 17 17-status-codes-and-sdk-retries.md has section 5.1 listing 408 substatus codes — 21013 should be added there with the retry-policy mapping.

verified by: grep -n "21013\|SERVER_WRITE_BARRIER_THROTTLED" sdk/cosmos/azure-cosmos/docs/StatusCodes.md → no output (confirmed undocumented)

// Check if all contacted replicas returned 429 Too Many Requests.
// Yield early to let ResourceThrottleRetryPolicy handle the retry with appropriate backoff,
// instead of returning QuorumNotSelected which would trigger an unnecessary primary read attempt.
if (!responseResult.isEmpty() && responseResult.stream().allMatch(r -> r.isThrottledException)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isEmpty() && allMatch(isThrottledException) predicate is duplicated at three sites in this file — here at :443, again at :713 (waitForReadBarrierAsync first throttle gate), and at :810 (waitForReadBarrierAsync second throttle gate). The three blocks are structurally identical except for the log strings and the early-return target.

Two reasons to extract a helper before merge:

  1. Pure drift risk — when xinlian12's still-open retryAfterInMs thread (currently on :445) gets addressed, the fix needs to land in all three sites consistently. With the current copy-paste shape it's easy to miss one.
  2. The early-yield decision is the new authoritative invariant of this PR. Having it expressed once (something like private Mono<X> tryEarlyYieldOn429(List<StoreResult>, ...) returning Optional.empty() when not all-throttled) makes the policy testable in isolation and removes the chance of three sites diverging behaviorally.

Would also let the unit-test footprint shrink — currently each new test exercises one of the three sites in a near-identical way.

…throttling

# Conflicts:
#	sdk/cosmos/azure-cosmos/CHANGELOG.md
mbhaskar and others added 3 commits June 11, 2026 11:18
…ottled 408

A 408 synthesized from write-barrier throttling (substatus
SERVER_WRITE_BARRIER_THROTTLED) is an RU hot-spot signal, not an endpoint
failure. Guard shouldRetryOnRequestTimeout() to surface it as terminal
(no retry) without marking the endpoint unavailable or triggering
per-partition automatic failover. Addresses review concern #1.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the write-barrier loop exhausts retries under consistent throttling
and synthesizes a 408 (substatus SERVER_WRITE_BARRIER_THROTTLED), capture
the last throttled 429 in an AtomicReference and pass it as the inner
exception, and propagate its server-advised x-ms-retry-after-ms header.
This keeps the diagnostics cause-chain intact and gives a downstream retry
policy honoring the 408 the server hint to work with. Addresses review
concern Azure#2.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Flip COSMOS.ENABLE_BARRIER_EARLY_YIELD_ON_429 default from false to true
to match the .NET SDK (default-on with opt-out). Update CHANGELOG and
test comments, and switch the opt-out E2E test to explicitly set the
flag to false (clearing the property now means enabled).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mbhaskar

Copy link
Copy Markdown
Member Author

@copilot resolve the merge conflicts in this pull request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants